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TITLE OF THE INVENTION 

PROCESSOR EXECUTING SIMD INSTRUCTIONS 

BACKGROUND OF THE INVENTION 
5 (1) Field of the Invention 

The present invention relates to a processor such as DSP and 
CPU, and more particularly to a processor suitable for performing 
signal processing for sounds. Images and others. 

10 (2) Description of tiie Related Art 

With the development in multimedia technologies, processors 
are Increasingly required to be capable of high-speed media 
processing represented by sound and image signal processing. As 
existing processors responding to such requirement, there exist 

15 Pentium (R)/ Pentium (R) n/ Pentium 4 (R) MMX/SSE/SSE2 and 
others produced by the Intel Corporation of the United States 
supporting SIMD (Single Instruction Multiple Data) instructions. Of 
them, MMX, for example, is capable of performing the same 
operations In one instruction on maximum of eight integers stored in 

20 a 64-bit MMX register. 

However, there is a problem that such existing processors do 
not fully satisfy a wide range of requirements concerning media 
processing. 

For example, although capable of operating on multiple data 
25 elements in a single Instruction and comparing multiple data 
elements in a single instruction, the existing processors cannot 
evaluate the results of such comparisons in a single Instruction. 
For example, an existing processor is capable of comparing two data 
elements stored In 32-bit registers on a byte-by-byte basis, and 
30 setting comparison results to four flags. However, It cannot make a 
judgment on whether all values of these four flags are zero or not In 
one instruction. For this reason, the processor needs to read out all 
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four flags and execute more than one instruction for judging 
whether all such values are zero or not. This requires a plurality of 
instructions for evaluating results every time a comparison is made 
against another set of pixel values when four pixel values are used 
5 as a unit of comparison, resulting in an increased number of 
instructions and therefore a decreased speed of image processing. 

SUMMARY OF THE INVENTION 

The present invention has been conceived in view of the 
10 above problem, and it is an object of this invention to provide a 
processor capable of executing sophisticated SIMD operations and a 
processor capable of high-speed digital signal processing suited for 
multimedia purposes. 

As i s obvious from t he above explanation, t he processor 
15 according to the present invention is capable of executing a 
characteristic SIMD instruction forjudging whether or not results of 
operations performed under a SIMD compare instruction are all zero 
and setting such results to condition flags. This allows a faster 
extraction of results of SIMD compare instructions (especially, 
20 agreement/disagreement of results), as well as a faster comparison 
processing to be performed on more than one pixel value as a 
processing unit and a faster detection of the EOF (End Of File) of a 
file. 

Moreover, the processor according to the present invention is 
25 capable of executing a characteristic instruction for storing, into a 
memory and the like, two pieces of byte data stored in one register 
(byte data stored in the higher 16 bits and byte data stored in the 
lower 16 bits). This eliminates the need for data type conversions 
when byte data is handled in 16-bit SIMD, making a speed of 
30 processing faster. 

Furthermore, the processor according to the present 
Invention is capable of executing a characteristic instruction for 
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storing an immediate value into the higher 16 bits of a register 
without changing the lower 16 bits of the register. This instruction, 
when combined with Instruction ''mov Rb, 116", makes it possible 
for a 32-bit Immediate value to be set in a register. 
5 Also, the processor according to the present invention is 

capable of executing a characteristic instruction for making a switch 
of objects to be added, depending on the value of a vector condition 
flag. This makes it possible for a single program to support half-pel 
motion compensation (motion compensation performed on a 
10 per-half-plxel basis) regardless of whether pixels are integer pixels 
or half pixels. 

Moreover, the processor according to the present invention is 
capable of executing a characteristic instruction for generating a 
value depending on the sign (positive/negative) of the value held in 
15 a register and whether a value held in a register is zero or not. This 
makes inverse quantization faster in image processing, since 1 is 
outputted when a certain value is positive, -1 when negative, and 0 
when 0. 

Furthermore, the processor according to the present 
20 invention is capable of executing a characteristic instruction for 
aligning word data and extracting different word data depending on 
a vector condition flag. This instruction makes it possible for a 
single program to support half-pel motion compensation (motion 
compensation performed on a per-half-pixel basis) regardless of 
25 whether pixels are integer pixels or half pixels. 

Also, the processor according to the present invention is 
capable of executing a characteristic instruction for adding two 
values and further adding 1 when one of the two values is positive. 
This realizes a faster rounding of an absolute value in image 
30 processing. 

Moreover, the processor according to the present invention is 
capable of executing a characteristic instruction for moving values 
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held in arbitrary two registers to two consecutive registers. Since 
values held in independent two registers are moved in one cycle 
under this instruction, an effect of reducing the number of cycles in 
a loop can be achieved. Also, this instruction, which does not 
5 involve register renaming (destruction of a register value), is 
effective when data is moved between loop generations (iterations). 

Furthermore, the processor according to the present 
invention is capable of executing a characteristic instruction for 
performing branches and setting condition flags (predicates, here) 

10 in a loop. This makes faster a loop to be executed by means of 
PROLOG/EPILOG removal software pipelining. 

Also, the processor according to the present invention Is 
capable of executing a characteristic Instruction for determining a 
sum of absolute value differences. This makes faster the speed of 

15 summing up absolute value differences in motion prediction as part 
of image processing. 

Moreover, the processor according to the present invention is 
capable of executing a characteristic instruction for converting a 
signed value into a saturated signed value at an arbitrary position 

20 (digit). This facilitates programming since there Is no need for 
setting a position where saturation Is performed to a specific 
position at the time of assembler programming. 

Furthermore, the processor according to the present 
Invention Is capable of executing a characteristic Instruction for 

25 selecting one of the values held in two registers on a word-by-word 
basis. This allows word data to be stored at an arbitrary position In 
a register, and therefore makes faster repetitions of data reshuffling. 
Moreover, this Instruction has an effect of Increasing the flexibility of 
SIMD operations. 

30 Also, the processor according to the present invention is 

capable of executing a characteristic instruction for extending 
results of a SIMD operation. This allows processing for making data 
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size all the same by performing sign extension or zero extension to 
be performed in one cycle, after performing the SIMD operations, 

jvioreover, the processor according to the present invention is 
capable of executing a characteristic Instruction for executing SIMD 
5 operations specified by condition flags and the like. This makes it 
possible for a single program to perform such dynamic processing as 
one In which the types of operations to be performed are determined 
depending on results of other processing. 

As described above, capable of performing sophisticated 

10 SIMD operations and a wide range of digital signal processing 
required for multimedia processing at a high speed, as well as 
capable of being employed as a core processor to be commonly used 
in mobile phone, mobile AV device, digital television, DVD and 
others, the processor according to the present invention is 

15 extremely useful in the present age in which the advent of 
high-performance and cost effective multimedia apparatuses is 
desired. 

Note that it possible to embody the present Invention not only 
as a processor executing the above-mentioned characteristic 

20 instructions, but also as an operation processing method Intended 
for a plurality of data elements and the like, and as a program 
including such characteristic Instructions- Also, It should be also 
understood that such program can be distributed via recording 
medium including CD-ROM and the like as well as via transmission 

25 medium including the internet and the like. 

For further Information about the technical background to this 
application, Japanese patent application No. 2002-280077 filed 
September 25, 2002, is incorporated herein by reference. 

30 BRIEF DESCRIPTION OF THE DRAWINGS 

These and other subjects, advantages and features of the 
invention will become apparent from the following description 
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thereof taken in conjunction with the accompanying drawings that 
illustrate a specific embodiment of the invention. In the Drawings: 

Fig.l is a schematic block diagram showing a processor 
according to the present invention. 
5 Fig. 2 is a schematic diagram showing arithmetic and 

logic/comparison operation units of the processor. 

Fig. 3 is a block diagram showing a configuration of a barrel 
shifter of the processor. 

Fig. 4 Is a block diagram showing a configuration of a 
10 converter of the processor. 

Fig. 5 is a block diagram showing a configuration of a divider 
of the processor. 

Fig. 6 is a block diagram showing a configuration of a 
multiplication/sum of products operation unit of the processor. 
15 Fig. 7 is a block diagram showing a configuration of an 

instruction control unit of the processor. 

Fig. 8 is a diagram showing a configuration of general-purpose 
registers (RO'^RSl) of the processor. 

Fig. 9 is a diagram showing a configuration of a link register 
20 (LR) of the processor. 

Fig. 10 i s a diagram s howing a c onfiguration of a b ranch 
register (TAR) of the processor. 

Fig. 11 is a diagram showing a configuration of a program 
status register (PSR) of the processor. 
25 Fig. 12 is a diagram showing a configuration of a condition flag 

register (CFR) of the processor. 

Figs.lSA and 13B are diagrams showing configurations of 
accumulators (MO, Ml) of the processor. 

Fig. 14 is a diagram showing a configuration of a program 
30 counter (PC) of the processor. 

Fig. 15 is a diagram showing a configuration of a PC save 
register (IPC) of the processor. 
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Fig. 16 is a diagram showing a configuration of a PSR save 
register (IPSR) of the processor. 

Fig. 17 is a timing diagram showing a pipeline behavior of the 
processor. 

5 Fig. 18 is a timing diagram showing each stage of the pipeline 

behavior of the processor at the time of executing an instruction. 

Fig. 19 is a diagram showing a parallel behavior of the 
processor. 

Fig. 20 is a diagram showing format of instructions executed 
10 by the processor. 

Fig. 21 is a diagram explaining an Instruction belonging to a 
category "ALUadd (addition) system)". 

Fig. 22 is a diagram explaining an Instruction belonging to a 
category "ALUsub (subtraction) system)". 
15 Fig. 23 is a diagram explaining an instruction belonging to a 

category "ALUIogic (logical operation) system and others". 

Fig. 24 is a diagram explaining an Instruction belonging to a 
category "CMP (comparison operation) system". 

Fig. 25 is a diagram explaining an instruction belonging to a 
20 category "mul (multiplication) system". 

Fig. 26 is a diagram explaining an Instruction belonging to a 
category "mac (sum of products operation) system". 

Fig. 27 is a diagram explaining an instruction belonging to a 
category "msu (difference of products) system". 
25 Fig. 28 is a diagram explaining an Instruction belonging to a 

category "MEMId (load from memory) system". 

Fig. 29 is a diagram explaining an instruction belonging to a 
category "MEMstore (store in memory) system". 

Fig. 30 is a diagram explaining an instruction belonging to a 
30 category "BRA (branch) system". 

Fig. 31 is a diagram explaining an instruction belonging to a 
category "BSasI (arithmetic barrel shift) system and others". 
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Fig. 32 is a diagram explaining an instruction belonging to a 
category "BSIsr (logical barrel shift) system and others". 

Fig. 33 is a diagram explaining an instruction belonging to a 
category "CNVvaIn (arithmetic conversion) system". 
5 Fig. 34 is a diagram explaining an instruction belonging to a 

category "CNV (general conversion) system". 

Fig. 35 is a diagram explaining an instruction belonging to a 
category ''SATvlpk (saturation processing) system". 

Fig. 36 is a diagram explaining an instruction belonging to a 
10 category "ETC (et cetera) system". 

Fig. 37 is a diagram showing a behavior of the processor when 
executing Instruction "vcchk". 

Fig. 38 is a diagram showing a detailed behavior when 
executing Instruction "vcchk". 
15 Fig. 39 Is a diagram showing a behavior of the processor when 

executing Instruction "stbh (Ra),Rb". 

Fig. 40 is a diagram showing a detailed behavior when 
executing Instruction "stbh (Ra),Rb". 

Fig. 41 is a diagram showing a behavior of the processor when 
20 executing Instruction "stbhp (Ra),Rb : Rb+1". 

Fig. 42 is a diagram showing a detailed behavior when 
executing Instruction "stbhp (Ra),Rb : Rb+1". 

Fig. 43 is a diagram showing a behavior of the processor when 
executing Instruction "sethi Ra,I16". 
25 Fig. 44 is a diagram showing a detailed behavior when 

executing Instruction "sethi Ra,I16". 

Fig. 45 is a diagram showing a behavior of the processor when 
executing Instruction "vaddhvc Rc,Ra,Rb". 

Fig. 46 is a diagram showing a detailed behavior when 
30 executing Instruction "vaddhvc Rc,Ra,Rb". 

Fig. 47 is a diagram explaining motion estimation in image 
processing. 
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Fig-48 is a diagram showing a behavior of the processor when 
executing Instruction ''vaddrhvc Rc,Ra,Rb". 

Fig. 49 is a diagram showing a detailed behavior when 
executing Instruction ''vaddrhvc Rc,Ra,Rb". 
5 Fig. 50 is a diagram showing a behavior of the processor when 

executing Instruction ""vsgnh Ra,Rb". 

Fig. 51 is a diagram showing a detailed behavior when 
executing Instruction ''vsgnh Ra,Rb". 

Fig. 52 is a diagram showing a behavior of the processor when 
10 executing Instruction ''valnvcl Rc,Ra,Rb". 

Fig. 53 is a diagram showing a detailed behavior when 
executing Instruction 'Valnvcl Rc,Ra,Rb". 

Fig. 54 is a diagram showing a detailed behavior when 
executing Instruction ''valnvc2 Rc,Ra,Rb". 
15 Fig. 55 Is a diagram showing a detailed behavior when 

executing Instruction ''valnvc3 Rc,Ra,Rb". 

Fig. 56 is a diagram showing a detailed behavior when 
executing Instruction ^'valnvc4 Rc,Ra,Rb". 

Fig. 57 is a diagram showing a behavior of the processor when 
20 executing Instruction ^^addarvw Rc,Rb,Ra". 

Fig. 58 is a diagram showing a detailed behavior when 
executing Instruction ^^addarvw Rc,Rb,Ra". 

Fig. 59 Is a diagram showing a behavior when performing 
'"rounding of absolute values (away from zero)". 
25 Fig. 60 Is a diagram showing a behavior of the processor when 

executing Instruction ""movp Rc : Rc+l,Ra,Rb". 

Fig. 61 is a diagram showing a detailed behavior when 
executing Instruction ""movp Rc : Rc+l,Ra,Rb". 

Fig. 62 is a diagram showing a detailed behavior when 
30 executing Instruction ''jloop C6,Cm,TAR,Ra". 

Fig. 63 is a diagram showing a detailed behavior when 
executing Instruction ''settar C6,Cm,D9". 
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Fig. 64 is a diagram showing PROLOG/EPILOG removal 
2-stage software pipelining. 

Fig. 65 is a diagram showing a list of a source program written 
in the C language. 
5 Fig. 66 is a diagram showing an example machine language 

program created using ordinary instructions '"jloop" and '"settar". 

Fig. 67 is a diagram showing an example machine language 
program created using Instructions ""jloop" and "settar" according to 
the preferred embodiment of the present invention. 
10 Fig. 68 is a diagram showing a detailed behavior when 

executing Instruction ^'jloop C6,C2 : C4,TAR,Ra". 

Fig. 69 is a diagram showing a detailed behavior when 
executing Instruction ''settar C6,C2 : C4,D9". 

Figs.70A and 70B a re diagrams s howing PROLOG/EPILOG 
15 removal 3-stage software pipelining. 

Fig. 71 is a diagram showing a list of a source program written 
in the C language. 

Fig. 72 is a diagram showing an example machine language 
program created using ordinary instructions ''jloop'' and ''settar". 
20 Fig. 73 is a diagram showing an example machine language 

program created using Instructions ""jloop" and ""settar" according to 
the preferred embodiment of the present invention. 

Fig. 74 is a diagram showing a behavior of the processor when 
executing Instruction ^'vsada Rc,Ra,Rb,Rx". 
25 Fig.75A is a diagram showing Instruction ""vsada Rc,Ra,Rb,Rx", 

and Flg.75B is a diagram showing Instruction '"vsada Rc,Ra,Rb". 

Fig. 76 is a diagram showing a behavior of the processor when 
executing Instruction ''satss Rc,Ra,Rb". 

Fig.77A is a diagram showing Instruction ''satss Rc,Ra,Rb" 
30 and Fig.77B Is a diagram showing Instruction ''satsu Rc,Ra,Rb". 

Fig. 78 is a diagram showing a behavior of the processor when 
executing Instruction ''bytesel Rc,Ra,Rb,Rx". 
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Fig.79A is a diagram showing a detailed behavior when 
executing Instruction "bytesel Rc,Ra,Rb,Rx", Fig.79B is a diagram 
showing a relationship between the register Rx and byte data to be 
selected, Fig.79C is a diagram showing a detailed behavior when 
5 executing Instruction "bytesel Rc,Ra,Rb,I12", and Flg.79D is a 
diagram showing a relationship between an Immediate value 112 
and byte data to be selected. 

Figs.SOA and SOB are diagrams showing a part of SIMD 
operation results being performed of bit extension (sign-extension 
10 or zero-extension). 

Fig. 81 Is a diagram showing all of SIMD operation results 
being performed of bit-extension. 

Fig. 82 Is a diagram showing a SIMD operation specified by 
condition flags and the like being performed. 

15 

DESCRIPTION OF THE PREFERRED EMBODIMENT 

An explanation is given for the architecture of the processor 
according to the present invention. The processor of the present 
invention is a general-purpose processor which has been developed 

20 targeting at the field of AV media signal processing technology, and 
instructions issued in this processor offer a higher degree of 
parallelism than ordinary microcomputers. Used as a core common 
to mobile phones, mobile AV devices, digital televisions, DVDs and 
others, the processor can Improve software usability. Furthermore, 

25 the present processor allows multiple high-performance media 
processes to be performed with high cost effectiveness, and 
provides a development environment for high-level languages 
intended for Improving development efficiency. 

FIg.l is a schematic block diagram showing the present 

30 processor. The processor 1 is comprised of an instruction control 
unit 10, a decoding unit 20, a register file 30, an operation unit 40, 
an I/F unit 50, an instruction memory unit 60, a data memory unit 



- 11 - 



70, an extended register unit 80, and an I/O interface unit 90. The 
operation unit 40 Includes arithmetic and logic/comparison 
operation units 41~43, a multiplication/sum of products operation 
unit 44, a barrel shifter 45, a divider 46, and a converter 47 for 
5 performing SIMD instructions. The multiplication/sum of products 
operation unit 44 Is capable of handling maximum of 65-blt 
accumulation so as not to decrease bit precision. The 
multiplication/sum of products operation unit 44 Is also capable of 
executing SIMD instructions as in the case of the arithmetic and 
10 logic/comparison operation units 41~43. Furthermore, the 
processor 1 is capable of parallel execution of an arithmetic and 
logic/comparison operation Instruction on maximum of three data 
elements. 

Fig. 2 Is a schematic diagram showing the arithmetic and 

15 logic/comparison operation units 41~43. Each of the arithmetic 
and logic/comparison operation units 41~43 is made up of an ALU 
unit 41a, a saturation processing unit 41b, and a flag unit 41c. The 
ALU unit 41a includes an arithmetic operation unit, a logical 
operation unit, a comparator, and a TST. The bit widths of 

20 operation data to be supported are 8 bits (use four operation units In 
parallel), 16 bits (use two operation units In parallel) and 32 bits 
(process 32-bit data using all operation units). For a result of an 
arithmetic operation, the flag unit 41c and the like detects an 
overflow and generates a condition flag. For a result of each of the 

25 operation units, the comparator and the TST, an arithmetic shift 
right, saturation by the saturation processing unit 41b, the 
detection of maximum/minimum values, absolute value generation 
processing are performed. 

Fig. 3 is a block diagram showing the configuration of the 

30 barrel shifter 45. The barrel shifter 45, which is made up of 
selectors 45a and 45b, a higher bit shifter 45c, a lower bit shifter 
45d, and a saturation processing unit 45e, executes an arithmetic 
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shift of data (shift in the 2's complement number system) or a 
logical shift of data (unsigned shift). Usually, 32-bit or 64-bit data 
are inputted to and outputted from the barrel shifter 45. The 
amount of shift of target data stored In the registers 30a and 30b are 
5 specified by another register or according to its immediate value. 
An arithmetic or logical shift in the range of left 63 bits and right 63 
bits is performed for the data, which is then outputted in an input bit 
length. 

The barrel shifter 45 is capable of shifting 8-, 16-, 32-, and 
10 64-blt data in response to a SIMD instruction. For example, the 
barrel shifter 45 can shift four pieces of 8-bit data in parallel. 

Arithmetic shift, which is a shift in the 2's complement 
number system, is performed for aligning decimal points at the time 
of addition and subtraction, for multiplying a power of 2 (2, the 2"'' 
15 power of 2, the -1^* power of 2) and other purposes. 

Fig. 4 is a block diagram showing the configuration of the 
converter 47. The converter 47 is made up of a saturation block 
(SAT) 47a, a BSEQ block 47b, an MSKGEN block 47c, a VSUMB block 
47d, a BCNT block 47e, and an IL block 47f. 
20 The saturation block (SAT) 47a performs saturation 

processing for input data. Having two blocks for the saturation 
processing of 32-bit data makes it possible to support a SIMD 
instruction executed for two data elements in parallel. 

The BSEQ block 47b counts consecutive Os or Is from the 

25 MSB. 

The MSKGEN block 47c outputs a specified bit segment as 1, 
while outputting the others as 0. 

The VSUMB block 47d divides the input data into specified bit 
widths, and outputs their total sum. 
30 The BCNT block 47e counts the number of bits in the input 

data specified as 1. 

The IL block 47f divides the input data into specified bit 
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widths, and outputs a value resulted from exchanging the position of 

each data block. 

Fig. 5 is a block diagram showing the configuration of the 

divider 46. Letting a dividend be 64 bits and a divisor be 32 bits, 
5 the divider 46 outputs 32 bits as a quotient and a nnodulo, 

respectively. 34 cycles are involved for obtaining a quotient and a 

nnodulo. The divider 46 can handle both singed and unsigned data. 

Note, however, that an identical setting is nnade concerning the 

presence/absence of signs of data serving as a dividend and a divisor 
10 Also, the divider 46 has the capability of outputting an overflow flag, 

and a 0 division flag. 

Fig. 6 is a block diagrann showing the configuration of the 

multiplication/sum of products operation unit 44. The 

multiplication/sum of products operation unit 44, which is made up 
15 of two 32-bit multipliers (MUL) 44a and 44b, three 64-bit adders 

(Adder) 44c~44e, a selector 44f and a saturation processing unit 

(Saturation) 44g, performs the following multiplications and sums of 

products: 

■ 32x32-bit signed multiplication, sum of products, and 
20 difference of products; 

■ 32x32-bit unsigned multiplication; 

■ 16xl6-bit signed multiplication, sum of products, and 
difference of products performed on two data elements in parallel; 
and 

25 ■ 32xl6-bit t signed multiplication, sum of products, and 

difference of products performed on two data elements in parallel; 

The above operations are performed on data in integer and 
fixed point format (hi, h2, wl, and w2). Also, the results of these 
operations are rounded and saturated. 

30 Fig. 7 is a block diagram showing the configuration of the 

instruction control unit 10. The instruction control unit 10, which is 
made up of an instruction cache 10a, an address management unit 
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10b, instruction buffers lOc^lOe, a jump buffer lOf, and a rotation 
unit (rotation) lOg, Issues instructions at ordinary times and at 
branch points. Having three 128-bit instruction buffers (the 
instruction buffers lOc^lOe) makes it possible to support the 
5 maximum number of parallel Instruction execution. Regarding 
branch processing, the instruction control unit 10 stores in advance 
a branch target instruction into the jump buffer lOf and stores a 
branch target address into the below-described TAR register before 
performing a branch (settar instruction). Thus, the instruction 

10 control unit 10 performs the branch using the branch target address 
stored in the TAR register and the branch target instruction stored in 
the jump buffer lOf. 

Note that the processor 1 is a processor employing the VLIW 
architecture. The VLIW architecture is an architecture allowing a 

15 plurality of instructions (e.g. load, store, operation, and branch) to 
be stored in a single instruction word, and such instructions to be 
executed all at once. By programmers describing a set of 
instructions which can be executed in parallel as a single issue group, 
it is possible for such issue group to be processed in parallel. In 

20 this specification, the delimiter of an issue group is indicated by 
Notational examples are described below. 
(Example 1) 

mov rl, 0x23;; 

This instruction description indicates that only an instruction 
25 "'mov" shall be executed. 
(Example 2) 

mov rl, 0x38 
add rO, rl, r2 

sub r3, rl, r2;; 

30 These instruction descriptions indicate that three instructions 

of ''mov", ''add" and "sub" shall be executed in parallel. 
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The instruction control unit 10 identifies an issue group and 
sends it to the decoding unit 20. The decoding unit 20 decodes the 
instructions in the issue group, and controls resources required for 
executing such instructions. 
5 Next, an explanation is given for registers included in the 

processor 1. 

Table 1 below lists a set of registers of the processor 1. 
[Table 1] 



Register name 


Bit width 


No. of registers 


Usage 


RO'-RSl 


32 bits 


32 


General-purpose registers. Used as data 
memory pointer, data storage and the like 
when operation instruction is executed. 


TAR 


32 bits 


1 


Branch register. Used as branch address 
storage at branch point. 


LR 


32 bits 


1 


Unk register. 


SVR 


16 bits 


2 


Save register. Used for saving condition flag 
(CFR) and various modes. 


MO-Ml 

(MHO:MLO- 

MH1~ML1) 


64 bits 


2 


Operation registers. Used as data storage 
when operation instruction is executed. 



10 

Table 2 below lists a set of flags (flags managed in a condition 
flag register and the like described later) of the processor 1. 
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[Table 2] 



Flag name 


Bit width 


No. of flags 


Usage 


C0~C7 


1 


8 


Condition flags. Indicate if condition is established 
or not. 


VC0~VC3 


1 


4 


Condition flags for media processing extension 
instruction. Indicate if condition is established or 
not. 


OVS 


1 


1 


Overflow flag. Detects overflow at the time of 
operation. 


CAS 


1 


1 


Carry flag. Detects carry at the time of operation. 


BPO 


5 


1 


Specifies bit position. Specifies bit positions to be 
processed when mask processing instruction is 
executed. 


ALN 


2 


1 


Specified byte alignment. 


FXP 


1 


1 


Fixed point operation mode. 


UDR 


32 


1 


Undefined register. 



5 Fig. 8 is a diagram showing tlie configuration of the 

general-purpose registers (R0'^R31) 30a. The general-purpose 
registers (RO'^RBl) 30a are a group of 32-bit registers that 
constitute an integral part of the context of a task to be executed 
and that store data or addresses. Note that the general-purpose 

10 registers R30 and R31 are used by hardware as a global pointer and 
a stack pointer, respectively. 

Fig. 9 is a diagram showing the configuration of a link register 
(LR) 30c. In connection with this link register (LR) 30c, the 
processor 1 also has a save register (SVR) not illustrated in the 

15 diagram. The link register (LR) 30c is a 32-bit register for storing a 
return address at the time of a function call. Note that the save 
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register (SVR) is a 16-blt register for saving a condition flag 
(CFR.CF) of the condition flag register at the time of a function call. 
The link register (LR) 30c is used also for the purpose of increasing 
the speed of loops, as in the case of a branch register (TAR) to be 
5 explained later. 0 is always read out as the lower 1 bit, but 0 must 
be written at the time of writing. 

For example, when "call (bri, jmpi)" Instructions are executed, 
the processor 1 saves a return address In the link register (LR) 30c 
and saves a condition flag (CFR.CF) in the save register (SVR). 

10 When "jmp" Instruction Is executed, the processor 1 fetches the 
return address (branch target address) from the link register (LR) 
30c, and restores a program counter (PC). Furthermore, when "ret 
(jmpr)" Instruction is executed, the processor 1 fetches the branch 
target address (return address) from the link register (LR) 30c, and 

15 stores (restores) it In/to the program counter (PC). Moreover, the 
processor 1 fetches the condition flag from the save register (SVR) 
so as to store (restore) It in/to a condition flag area CFR.CF in the 
condition flag register (CFR) 32. 

Fig. 10 is a diagram showing the configuration of the branch 

20 register (TAR) 30d. The branch register (TAR) 30d is a 32-bit 
register for storing a branch target address, and used mainly for the 
purpose of increasing the speed of loops. 0 Is always read out as 
the lower 1 bit, but 0 must be written at the time of writing. 

For example, when "jmp" and "jloop" instructions are 

25 executed, the processor 1 fetches a branch target address from the 
branch register (TAR) 30d, and stores it In the program counter (PC). 
When the Instruction indicated by the address stored in the branch 
register (TAR) 30d is stored in a branch instruction buffer, a branch 
penalty will be 0. An increased loop speed can be achieved by 

30 storing the top address of a loop in the branch register (TAR) 30d. 

Fig. 11 is a diagram showing the configuration of a program 
status register (PSR) 31. The program status register (PSR) 31, 



- 18- 



which constitutes an integral part of the context of a task to be 
executed, is a 32-bit register for storing the following processor 
status information: 

Bit SWE: indicates whether the switching of VMP (Virtual 
Multi-Processor) to LP (Logical Processor) is enabled or disabled. 
"0" indicates that switching to LP is disabled and "1" indicates that 
switching to LP Is enabled. 

Bit FXP: indicates a fixed point mode. "0" Indicates the mode 
0 and "1" indicates the mode 1. 

Bit IH: is an interrupt processing flag indicating that 
maskable interrupt processing is ongoing or not. "1" indicates that 
there is an ongoing interrupt processing and "0" indicates that there 
is no ongoing interrupt processing. This flag is automatically set on 
the occurrence of an interrupt. This flag is used to make a 
distinction of whether interrupt processing or program processing is 
taking place at a point in the program to which the processor returns 
in response to "rti" instruction. 

Bit EH: is a flag indicating that an error or an NMI is being 
processed or not. "0" indicates that error/NI^I interrupt processing 
is not ongoing and "1" indicates that error/NMI Interrupt processing 
is ongoing. This flag is masked if an asynchronous error or an NMI 
occurs when EH = 1. Meanwhile, when VMP is enabled, plate 
switching of VMP is masked. 

Bit PL [1:0]: Indicates a privilege level. "00" indicates the 
privilege level 0, i.e., the processor abstraction level, "01" indicates 
the privilege level 1 (non-settable), "10" indicates the privilege level 
2, i.e., the system program level, and "11" indicates the privilege 
level 3, i.e., the user program level. 

Bit LPIE3: indicates whether LP-specific interrupt 3 is enabled 
or disabled. "1" indicates that an interrupt is enabled and "0" 
indicates that an interrupt is disabled. 

Bit LPIE2: indicates whether LP-specific interrupt 2 is enabled 
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or disabled. "1" indicates that an interrupt is enabled and "0" 
indicates that an interrupt is disabled. 

Bit LPIEl: indicates whether LP-specific interrupt 1 is enabled 
or disabled. ''1" indicates that an interrupt is enabled and "0" 
5 indicates that an interrupt is disabled. 

Bit LPIEO: indicates whether LP-specific interrupt 0 is enabled 
or disabled. "1" indicates that an interrupt is enabled and "0" 
indicates that an interrupt is disabled. 

Bit AEE: indicates whether a misalignment exception is 
10 enabled or disabled. ''1" indicates that a misalignment exception is 
enabled and "0" indicates that a misalignment exception is disabled. 

Bit IE: indicates whether a level interrupt is enabled or 
disabled. "1" indicates that a level interrupt is enabled and "0" 
indicates a level interrupt is disabled. 
15 Bit IM [7:0]: Indicates an interrupt mask, and ranges from 

levels 0~7, each being able to be masked at Its own level. Level 0 
Is the highest level. Of Interrupt requests which are not masked by 
any IMs, only the interrupt request with the highest level is accepted 
by the processor 1. When an interrupt request Is accepted, levels 
20 below the accepted level are automatically masked by hardware. 
IM[0] denotes a mask of level 0, IM[1] a mask of level 1, IM[2] a 
mask of level 2, IM[3] a mask of level 3, IM[4] a mask of level 4, 
IM[5] a mask of level 5, IM[6] a mask of level 6, and IM[7] a mask 
of level 7. 

25 reserved: indicates a reserved bit. 0 is always read out. 0 

must be written at the time of writing. 

Fig. 12 is a diagram showing the configuration of the condition 
flag register (CFR) 32. The condition flag register (CFR) 32, which 
constitutes an integral part of the context of a task to be executed, 

30 is a 32-bit register made up of condition flags, operation flags, 
vector condition flags, an operation instruction bit position 
specification field, and a SIMD data alignment information field. 



-20- 



Bit ALN [1:0]: indicates an alignment mode. An alignment 
mode of "valnvc" instruction is set. 

Bit BPO [4:0]: indicates a bit position. It is used in an 
instruction that requires a bit position specification. 
5 Bit VC0~VC3: are vector condition flags. Starting from a 

byte on the LSB side or a half word through to the MSB side, each 
corresponds to a flag ranging from VCO through to VC3. 

Bit OVS: Is an overflow flag (summary). It is set on the 
detection of saturation and overflow. If not detected, a value 
10 before the instruction is executed is retained. Clearing of this flag 
needs to be carried out by software. 

Bit CAS: is a carry flag (summary). It is set when a carry 
occurs under "addc" instruction, or when a borrow occurs under 
"subc" instruction. If there is no occurrence of a carry under "addc" 
15 Instruction, or a borrow under "subc" instruction, a value before the 
instruction is executed is retained. Clearing of this flag needs to be 
carried out by software. 

Bit C0~C7: are condition flags. The value of the flag C7 is 
always 1. A reflection of a FALSE condition (writing of 0) made to 
20 the flag C7 is ignored. 

reserved: indicates a reserved bit. 0 is always read out. 0 
must be written at the time of writing. 

Figs.l3A and 13B are diagrams showing the configurations of 
accumulators (MO, Ml) 30b. Such accumulators (MO, Ml) 30b, 
25 which constitute an integral part of the context of a task to be 
executed, are made up of a 32-bit register MHO-MHl (register for 
multiply and divide/sum of products (the higher 32 bits)) shown in 
Fig.l3A and a 32-blt register MLO-MLl (register for multiply and 
divide/sum of products (the lower 32 bits)) shown in Fig.l3B. 
30 The register MHO-MHl is used for storing the higher 32 bits of 

operation results at the time of a multiply instruction, while used as 
the higher 32 bits of the accumulators at the time of a sum of 
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products instruction. I^loreover, the register MHO-|viHl can be used 
in combination with the general-purpose registers in the case where 
a bit stream is handled. I*»1eanwhile, the register MLO-I^Ll is used 
for storing the lower 32 bits of operation results at the time of a 
5 multiply instruction, while used as the lower 32 bits of the 
accumulators at the time of a sum of products instruction. 

Fig. 14 is a diagram showing the configuration of a program 
counter (PC) 33. This program counter (PC) 33, which constitutes 
an integral part of the context of a task to be executed, is a 32-bit 
10 counter that holds the address of an instruction being executed. 

Fig. 15 is a diagram showing the configuration of a PC save 
register (IPC) 34. This PC save register (IPC) 34, which constitutes 
an integral part of the context of a task to be executed is a 32-bit 
register. 

15 Fig. 16 is a diagram showing the configuration of a PSR save 

register (IPSR) 35. This PSR save register (IPSR) 35, which 
constitutes an Integral part of the context of a task to be executed, 
is a 32-blt register for saving the program status register (PSR) 31. 
0 Is always read out as a part corresponding to a reserved bit, but 0 

20 must be written at the time of writing. 

Next, an explanation is given for the memory space of the 
processor 1. In the processor 1, a linear memory space with a 
capacity of 4 GB is divided into 32 segments, and an instruction 
SRAM (Static RAM) and a data SRAM are allocated to 128-MB 

25 segments. With a 128-MB segment serving as one block, a target 
block to be accessed is set in a SAR (SRAM Area Register). A direct 
access is made to the Instruction SRAM/data SRAM when the 
accessed address is a segment set in the SAR, but an access request 
shall be issued to a bus controller (BCD) when such address Is not a 

30 segment set in the SAR. An on chip memory (OCM), an external 
memory, an external device, an I/O port and others are connected to 
the BUC. Data reading/writing from and to these devices is 
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possible. 

Fig. 17 is a timing diagram showing the pipeline behavior of 
the processor 1. As illustrated in the diagram, the pipeline of the 
processor 1 basically consists of the following five stages: 
instruction fetch; instruction assignment (dispatch); decode; 
execution; and writing. 

Fig. 18 is a timing diagram showing each stage of the pipeline 
behavior of the processor 1 at the time of executing an instruction. 
In the instruction fetch stage, an access is made to an instruction 
memory which is indicated by an address specified by the program 
counter (PC) 33, and the instruction is transferred to the Instruction 
buffers 10c~10e and the like. In the instruction assignment stage, 
the output of branch target address information in response to a 
branch instruction, the output of an input register control signal, the 
assignment of a variable length instruction are carried out, which is 
followed by the transfer of the instruction to an Instruction register 
(IR). In the decode stage, the IR is inputted to the decoding unit 20, 
and an operation unit control signal and a memory access signal are 
outputted. In the execution stage, an operation is executed and 
the result of the operation is outputted either to the data memory or 
the general-purpose registers (R0~R31) 30a. In the writing stage, 
a value obtained as a result of data transfer, and the operation 
results are stored in the general-purpose registers. 

The VLIW architecture of the processor 1 allows parallel 
execution of the above processing on maximum of three data 
elements. Therefore, the processor 1 performs the behavior shown 
in Fig. 18 in parallel at the timing shown in Fig. 19. 

Next, an explanation is given for a set of instructions 
executed by the processor 1 with the above configuration. 

Tables 3~5 list categorized instructions to be executed by the 
processor 1. 
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[Table 3] 



Category 


Opera 
tion 

unit 


Instruction operation code 


Memory transfer 
instruction (load) 


M 


ldjdhjdhu,ldb,ldbujdp,ldhp,ldbp,ldbh. 

It! 1 till ■ 

ldbuh,ldbhp,ldbuhp 


Memory transfer 
instruction (store) 


M 


st,sth,stb,stp,sthp,stbp,stbh,stbhp 


Memory transfer 
instruction (others) 


M 


dpref,ldstb 


External register 
transfer instruction 


M 


rd,rde,wt,wte 


Branch instruction 


B 


br,brl,call,jmpJmplJmpr,ret,jmpf,jloo 
p,setbb,setlr,settar 


Software interrupt 
instruction 


B 


rti,piO,piOI,pil,pill,pi2,pi2l,pi3,pi3l,pi4 
,pi4l,pi5,pi5l,pi6,pi6l,pi7,pi7l,sc0,scl,s 
C2,sc3,sc4,sc5,sc6,sc7 


VMP/interrupt 
control instruction 


B 


intd,inte,vmpsleep,vmpsus,vmpswd,v 
mpswe,vmpwait 


Arithmetic operation 
instruction 


A 


abs,absvh,absvw,add,addarvw,addc,ad 
dmsk,adds,addsr,addu,addvh,addvw,n 
eg,negvh,negvw,rsub,sladd,s2add,sub 
,subc,submsk, subs, subvh,subvw, max, 
min 


Logical operation 
instruction 


A 


and,andn,or,sethi,xor,not 


Compare instruction 


A 


cmpCC,cmpCCa,cmpCCn,cmpCCo,tstn, 
tstna,tstnn,tstno,tstz,tstza,tstzn,tstzo 


Move instruction 


A 


mov,movcf,mvclcas,mvclovs,setlo,vcc 
hk 


NOP instruction 


A 


nop 


Shift instruction! 


SI 


asl,aslvh,aslvw,asr,asrvh,asrvw,lsl,lsr, 
rol,ror 


Shift instruct!on2 


S2 


aslp,aslpvw,asrp,asrpvw,lslp,lsrp 
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[Table 4] 



Category 


Opera 

tion 

unit 


Instruction operation code 


Extraction instruction 


S2 


ext,extb,extbu,exth,exthu,extr,extru,e 
xtu 


Mask instruction 


C 


nnsk^mskgen 


Saturation 
instruction 


C 


sat!2,sat9,satb,satbu,sath,satw 


Conversion 
instruction 


c 


valn,valnl,valn2,valn3,valnvc!,valnvc 
2,valnvc3,valnvc4,vlipkb,vlnpkh,vliunp 
kb,vhunpkh,vintlhb,vintlhh,vintllb,vintl 

Ih x/lr^l^h v/lr^l^hii \/lnl^h \/lnl^Hii \/liinnl/^K 

lll/VI|JlviJ^VipiS.lJIJ^Vipi\II^VI|JI\.IIU^VIUIl^K 

vlunpkbu,vlunpkli,vlunpkhu,vstovb,vst 
ovli,vunpk!,vunpk2,vxchngh,vexth 


Bit count instruction 


c 


bcntl,bseq,bseqO,bseql 


Oi"hpr<; 


r 


U y LCI CVfCALW/llldlxLli VU/IIIOlvUI Vll/I llUVfl^ 

movp 


Miilt"inl\/ i n^iTMpfin n 1 




IIIIUIIlll^l IMLJIIIIII ft IIILllllWfllllUlll w W ^ 1 1 1 1 1 U 1 

,lmul 






FmiilxA/iA/ mill mi ill 1 

1 IIIUIVVVV^IilUI^IIILIIU 


Qiim nf nrorliirfrQ 
instruction! 


yi 


1 iiiadiii/i iiicii.11111 ^1 IIICIUI1W/I liiaL.ilwvVjiliii 

ac,lmac 


Sum of products 
instruction2 


X2 


fmacww,mac 


Difference of 
products instruction! 


XI 


fmsulili,fmsuhlir,fmsuhw,fmsuww,lims 
u,lmsu 


Difference of 

products instruction2 


X2 


fnnsuww,msu 


Divide instruction 


DIV 


div,divu 


Debugger instruction 


DBGM 


dbgnn0,dbgm!,dbgm2,dbgnn3 
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[Table 5] 



Category 


Opera 

tion 

unit 


Instruction operation code 


^TMD arifhmpfir 

»^ XI 1 Cll ILIIIIICLI^ 

operation instruction 


A 
r\ 


vciui>iivii,vauuu,vaaun,vauunvc,vaaunv 

h,vaddrhvc,vaddsb,vaddsh,vaddsrb,va 

ddsrh,vasubb,vcchk,vhaddh,vhaddhvh, 

vhsubh,vhsubhvh,vladdh,vladdhvh,vls 

ubh,vlsubhvh,vnegb,vnegh,vneghvh,vs 

addb,vsaddh,vsgnh,vsrsubb,vsrsubh,v 

ssubb,vssubh,vsubb,vsubh,vsubhvh,vs 

ubsh,vsumh,vsumh2,vsumrh2,vxaddh, 

vxaddhvh,vxsubh,vxsubhvh, 

vmaxb,vmaxh,vminb,vminh,vmovt,vse 

1 


SIMD compare 
instruction 


A 


vcmpeqb,vcmpeqh,vcmpgeb,vcmpgeh, 

vcmpgtb,vcmpgth,vcmpleb,vcmpleh,vc 

[iipiLu,vciiipiLii,vciTipneD,vciTipnen/ 

vscmpeqb,vscmpeqh,vscmpgeb,vscmp 

geh,vscmpgtb,vscmpgth,vscmpleb,vsc 

mpleh,vscmpltb,vscmplth,vscmpneb,v 

scmpneh 


SIIV|D shift 
instruction! 


SI 


vaslb,vaslh,vaslvh,vasrb,vasrh,vasrvh, 
vlslb,vlslh,vlsrb,vlsrh,vrolb,vrolh,vrorb 
,vrorh 


SIMD shift 
instructionZ 


S2 


vasl,vaslvw,vasr,vasrvw,vlsl,vlsr 


SIMD saturation 
instruction 


C 


vsath,vsathl2,vsath8,vsath8u,vsath9 


instruction 


c 


Va UbbU m D, V m U V n 


SIMD multiply 
instruction 


X2 


vfmulh,vfmulhr,vfmulw,vhfmulh,vhfmu 
Ihr vhfmiilw \/hmijl vlfmiilh x/lfmiilhr \/lf 

■ III ^VilllllUI VV / VIIIIIUI^VIIIIIUIII^VII IIILlllll / Vil 

mulw,vlmul,vmul,vpfmulhww,vxfmulh, 
vxfmulhr,vxfmulw,vxmul 


SIMD sum of 
products instruction 


X2 


vfmach,vfmachr,vfmacw,vhfmach,vhf 
machr,vhfmacw,vhmac,vlfmach,vlfmac 
hr,vlfmacw,vlmac,vmac,vpfmachww,vx 
fmach,vxfmachr,vxfmacw,vxmac 


SIMD difference of 
products instruction 


X2 


vfmsuh,vfmsuw,vhfmsuh,vhfmsuw,vh 

msu,vlfmsuh,vlfmsuw,vlmsu,vmsu,vxf 

msuh,vxfmsuw,vxmsu 
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Note that ''Operation units" in the above tables refer to 
operation units used in the respective instructions. i^ore 
specifically, ''A" denotes ALU instruction, ''B" branch instruction, ''C" 
conversion instruction, ''DIV" divide instruction, ''DBGM" debug 
5 instruction, ''M" memory access instruction, "'SI" and ''S2" shift 
instruction, and '"Xl" and ''X2" multiply instruction. 

Fig,20 is a diagram showing the format of the instructions 
executed by the processor 1. 

The following describes what acronyms stand for in the 

10 diagrams: ""P" Is predicate (execution condition: one of the eight 
condition flags C0^C7 Is specified); ''OP" is operation code field; "R" 
is register field; "I" is immediate field; and "D" Is displacement field. 
Furthermore, predicates, which are flags for controlling whether or 
not an instruction is executed based on values of the condition flags 

15 C0<^C7, serve as a technique that allows instructions to be 
selectively executed without using a branch instruction and 
therefore that accelerates the speed of processing. 

Figs.21'^^36 are diagrams explaining outlined functionality of 
the instructions executed by the processor 1. More specifically, 

20 Fig. 21 explains an instruction belonging to the category "ALUadd 
(addition) system)"; Fig. 22 explains an Instruction belonging to the 
category "ALUsub (subtraction) system)"; Fig. 23 explains an 
Instruction belonging to the category "ALUIogIc (logical operation) 
system and others"; Fig. 24 explains an instruction belonging to the 

25 category "CMP (comparison operation) system"; Fig. 25 explains an 
Instruction belonging to the category "mul (multiplication) system"; 
Fig. 26 explains an instruction belonging to the category "mac (sum 
of products operation) system"; Fig. 27 explains an instruction 
belonging to the category "msu (difference of products) system"; 

30 Fig. 28 explains an instruction belonging to the category "MEMId 
(load from memory) system"; Fig. 29 explains an instruction 
belonging to the category "MEMstore (store In memory) system"; 
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Fig. 30 explains an instruction belonging to tlie category "BRA 
(branch) system"; Fig. 31 explains an instruction belonging to the 
category "BSasI (arithmetic barrel shift) system and others"; Fig. 32 
explains an instruction belonging to the category "BSIsr (logical 
5 barrel shift) system and others"; Fig. 33 explains an instruction 
belonging to the category "CNVvaIn (arithmetic conversion) 
system"; Fig. 34 explains an Instruction belonging to the category 
"CNV (general conversion) system"; Fig. 35 explains an instruction 
belonging to the category '"SATvlpk (saturation processing) 

10 system"; and Fig. 36 explains an instruction belonging to the 
category "ETC (et cetera) system". 

The following describes the meaning of each column in these 
diagrams: "SIMD" indicates the type of an instruction (distinction 
between SISD (SINGLE) and SIMD); "Size" indicates the size of 

15 individual operand to be an operation target; "Instruction" indicates 
the operation code of an operation; "Operand" indicates the 
operands of an instruction; "CFR" indicates a change in the condition 
flag register; "PSR" indicates a change in the processor status 
register; "Typical behavior" indicates the overview of a behavior; 

20 "Operation unit" indicates a operation unit to be used; and "3116" 
indicates the size of an instruction. 

Figs.37~748 are diagrams explaining the detailed 
functionality of the instructions executed by the processor 1. Note 
that the meaning of each symbol used for explaining the instructions 

25 is as described in Tables 6~10 below. 
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[Table 6] 



Svmbol 



Meanina 



X[i] Bit number i of X 
X[i:j] Bit number j to bit number i of X 
X:Y Concatenated X and Y 
{n{X}} n repetitions of X 

sextM(X,N) Sign-extend X from N bit width to 1^ bit widtli. Default of 
M is 32. 

Default of N is all possible bit widths of X. 

uextM(X,N) Zero-extend X from N bit width to M bit width. Default of 
M is 32. 

Default of N is all possible bit widths of X. 
smul(X,Y) Signed multiplication X * Y 
umul(X,Y) Unsigned multiplication X * Y 
sdiv(X,Y) Integer part in quotient of signed division X / Y 
smod(X,Y) Modulo with the same sign as dividend. 
udiv(X,Y) Quotient of unsigned division X / Y 
umod{X,Y) Modulo 
abs{X) Absolute value 

bseq(X,Y) for {i = 0; i<32; i + + ) { 
if (X[31-i] != Y) break; 



> 

result 
bcnt{X,Y) 



= «; 

S = 0; 
for (1=0; i<32; i ++) { 

if (X[i] == Y) S+ + ; 



result = S; 
max(X,Y) result = 
min(X,Y) 
tst2(X,Y) 
tstn(X,Y) 



(X 

result = (X 
X & Y == 0 
X & Y 1= 0 



Y)? X 
Y)? X 



Y 
Y; 
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[Table 7] 



5 



25 



Symbol 


Meaning 








Ra Ra[31:0] 


Register numbered a (0 <= a < 


= 31) 






Ra + 1 R(a + l)[31:0] 


Register numbered a + 1 (0 <= a < = 


= 30) 






Rb Rb[31:0] 


Register numbered b (0 <= b < = 


31) 






Rb + 1 R(b + l)[31:0] 


Register numbered b+1 (0 <= b < = 


= 30) 






Rc Rc[31:0] 


Register numbered c (0 <= c < = 


31) 






Rc+1 R(c+l)[31:0] 


Register numbered c+lRegister (0 < 


= c < 


= 


30) 


Ra2 Ra2[31:0] 


Register numbered a2 (0 <= a2 < = 


15) 






Ra2 + 1 R(a2 + l)[31:0] Register numbered a2 + l (0 


<= a2 


< 


= 14) 


Rb2 Rb2[31:0] 


Register numbered b2 (0 <= b2 < = 


15) 






Rb2+1 R(b2+l)[31:0] Register numbered b2+l (0 


<= b2 


< 


= 14) 


Rc2 Rc2[31:0] 


Register numbered c2 (0 <= c2 < = 


15) 






Rc2 + 1 R(c2 + l)[31:0] Register numbered c2 + l (0 


<= c2 


< 


= 14) 


Ra3 Ra3[31:0] 


Register numbered a3 (0 <= a3 < = 


7) 






Ra3 + 1 R(a3+l)[31:0] Register numbered a3+l (0 


<= a3 


< 


= 6) 


Rb3 Rb3[31:0] 


Register numbered b3 (0 <= b3 < = 


7) 






Rb3+1 R(b3 + l)[31:0] Register numbered b3+l (0 


<= b3 


< 


= 6) 


Rc3 Rc3[31:0] 


Register numbered c3 (0 <= c3 < = 


7) 






Rc3 + 1 R(c3 + l)[31:0] Register numbered c3 + l (0 


<= c3 


< 


= 6) 


Rx Rx[31:0] 


Register numbered x (0 <= x < = 


3) 







30 
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[Table 8] 



Svmbol 



Meanina 



10 



15 



20 



25 



30 



+ Addition 

Subtraction 
& Logical AND 

I Logical OR 

I Logical NOT 

<< Logical shift left (arlthnnetic shift left) 

>> Arithmetic shift right 

>>> Logical shift right 

^ Exclusive OR 

~ Logical NOT 

= = Equal 

! = Not equal 

> Greater than Signed(regard left-and 

sign) 

>= Greater than or equal to Slgned{regard left-and 
sign) 

>(u) Greater than Unsigned(Not regard left 

as sign) 

> = (u) Greater than or equal to Unslgned(Not regard left 
as sign) 

< Less than 
sign) 

<= Less than or equal to 
sign) 

<(u) Less than 
as sign) 

< = {u) Less than or equal to 
as sign) 



Signed(regard left-and 
Signed (regard left-and 
Unsigned(Not regard left 
Unsigned(Not regard left 



right- part 
right-part 
-and right- 
-and right- 
right-part 
right-part 
-and right- 
-and right- 



MSBs as 
MSBs as 
part MSBs 
part MSBs 
MSBs as 
MSBs as 
part MSBs 
part MSBs 
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[Table 9] 



Symbol 



Meaning 



□(addr) Double word data corresponding to address "addr" in Memory 
W(addr)Word data corresponding to address "addr" In Memory 
H(addr) Half data corresponding to address "addr" In Memory 
B(addr) Byte data corresponding to address "addr" in Memory 
B(addobus_lock) 

Access byte data corresponding to address "addr" in Memory, and locl< used bus 

concurrently (unlockable bus shaU not be locked) 

B(addr,bus_unlock) 

Access byte data corresponding to address "addr" in Memory, and unlock used 
bus concurrently (unlock shall be ignored for unlockable bus and bus which has 
not been locked) 

EREG(num) Extended register numbered "num" 

EREG_ERR To be 1 if error occurs when immediately previous access is 

made to extended register. 

To be 0, when there was no error. 

<- Write result 

= > Synonym of instruction (translated by assembler) 

reg#(Ra) Register number of general-purpose register Ra(5-bit value) 

Ox Prefix of hexadecimal numbers 

Ob Prefix of binary numbers 

tmp Temporally variable 

UD Undefined value (value which is implementation-dependent value or 
which varies dynamically) 

Dn Displacement value (n is a natural value indicating the number of bits) 
In Immediate value (n is a natural value indicating the number of bits) 
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[Table 10] 



Symbol 



Meanina 



OExplanation for syntax 
if (condition) { 

Executed when condition is nnet; 

} else { 

Executed wlien condition is not met; 

} 

Executed when condition A is met, if (condition A); * Not executed 

when condition A is not met 

for (Expressionl;Expression2;Expression3) * Same as C language 

(Expressionl)? Expression2:Expression3 * Same as C language 

OExplanation for terms 

The following explains terms used for explanations: 
Integer multiplication Multiplication defined as ''smuT' 
Fixed point multiplication 

Arithmetic shift left is performed after integer operation. When PSR.FXP is 0, the 
amount of shift is 1 bit, and when PSR.FXP is 1, 2 bits. 
SIMD operation straight / cross / high / low / pair 

Higher 16 bits and lower 16 bits of half word vector data is RH and RL, 
respectively. When operations performed on at Ra register and Rb register are 
defined as follows: 



RHb 



straight 

cross 
high 

) 

low 
pair 



Operation is performed between RHa and RHb 

Operation is performed between RHa and RLb, and RLa and RHb 

Operation is performed between RHa and RHb, and RLa and 



Operation is performed between RHa and RLb, and RLa and RLb 
Operation is performed between RH and RHb, and RH and RLb 



(RH is 32-blt data) 



35 [Instruction vcchk] 

Instruction vcchk is a SIMD instruction for judging whether 
results of a SIMD compare instruction (e.g. vcmpCCb) are all zero or 
not, and setting the results to the condition flag register (CFR) 32. 
For example, when 

40 vcchk 

the processor judges, as illustrated in Fig. 37, whether the vector 
condition flags VC0'^VC3 (110) in the condition flag register (CFR) 
32 are all zero or not, and sets the condition flags C4 and C5 in the 
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condition flag resister (CFR) 32 to 1 and 0 respectively when all of 
the vector condition flags VC0~VC3 (110) are zero, while setting the 
condition flags C4 and C5 in the condition flag resister (CFR) 32 to 0 
and 1 respectively when not all the vector condition flags VC0~VC3 

5 (110) are zero. Then, the vector condition flags VC0~VC3 are 
stored in the condition flags C0~C3. A detailed behavior Is as 
shown in Fig. 38. 

This Instruction allows a faster extraction of results of SIMD 
connpare Instructions (especially, agreement/disagreement of 

10 results), and is effective when detecting the EOF (End Of File) of a 
file and other purposes. 

[Instruction stbh, stbhp] 

Instruction stbh is an instruction for storing, into a memory 
and the like, two pieces of byte data stored in one register (byte 
15 data stored in the higher 16 bits and byte data stored in the lower 16 
bits). This Instruction is paired with Instruction Idbh (for moving 
data in the opposite direction). For example, when 

stbh (Ra), Rb 

the processor 1, using the I/F unit 50 and others, stores two pieces 
20 of byte data stored in the register Rb (the 16~23th bits and the 
0~7th bits in the register Rb) Into storage locations Indicated by 
addresses specified by the register Ra, as illustrated In Fig. 39. A 
detailed behavior is as shown in Fig. 40. 

Instruction stbhp is an instruction for storing, into a memory 
25 and the like, four pieces of byte data stored in two registers (pair 
registers) (two pieces of byte data stored in the higher 16 bits of the 
respective registers and two pieces of byte data stored in the lower 
16 bits of the respective registers). This instruction is paired with 
Instruction Idbhp (for moving data in the opposite direction). For 
30 example, when 

stbhp (Ra), Rb: Rb+1 
the processor 1, using the I/F unit 50 and others, stores four pieces 
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of byte data stored in the registers Rb and Rb+1 (the 16~23th bits 
and the 0~7th bits in the respective registers) into storage locations 
indicated by addresses specified by the register Ra, as illustrated in 
Fig. 41. A detailed behavior is as shown in Fig. 42. 

These instructions eliminate the need for data type 
conversions when byte data Is handled In 16-bit SIMD, leading to a 
faster processing speed. 

[Instruction sethI] 

Instruction sethi is an instruction for storing an Imnnediate 
value in the higher 16 bits of a register without changing the lower 
16 bits of the register. For example, when 

sethi Ra, 116 

the processor 1 stores a 16-blt immediate value (116) in the higher 
16 bits of the register Ra, as shown In Fig. 43. When this Is done, 
there is no change in the lower 16 bits of the register Ra. A detailed 
behavior is as shown in Fig. 44. 

This instruction, when combined with Instruction "mov Rb, 
116", makes it possible for a 32-blt immediate value to be set in a 
register. 

[Instruction vaddhvc, vaddrhvc] 

Instruction vaddhvc is a SIMD instruction for making a switch 
of objects to be added, depending on the value of a vector condition 
flag. For example, when 

vaddhvc Rc, Ra, Rb 
the processor 1, using the operation unit 40 and others, adds the 
value held In the register Ra with the value held In the register Ra or 
Rb in the half word vector format, and stores the result Into the 
register Rc, as shown In Fig. 45. When this is done, whether the 
value held In Ra or the value held in Rb is added depends on a value 
of the vector condition flag VC2. More specifically, when the vector 
condition flag VC2=1, the value held in the register Ra and the value 
held in the register Rb are added, and when VC2 = 0, the value held 
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in the register Ra and the value held in the register Ra are added. A 
detailed behavior is as shown in Fig. 46. 

This instruction is effective when used for motion 
connpensation in image processing. Since a value resulted from 
dividing the value held in the addition result register Rc by 2 serves 
as the average value of Ra or the average value of Ra and Rb, there 
Is an advantage that a single program can support half-pel motion 
compensation (motion compensation performed on a per-half-pixel 
basis) regardless of whether pixels are integer pixels or half pixels, 
as shown in Fig. 47. 

Meanwhile, Instruction vaddrhvc is equivalent to an 
instruction in which rounding is performed in addition to processing 
of the above-explained Instruction vaddhvc. For example, when 

vaddrhvc Rc, Ra, Rb 
the processor 1, using the arithmetic and logic/comparison 
operation unit 41 and others, adds the value held in the register Ra 
with the value held in the register Ra or Rb In the half word vector 
format and further adds 1 for rounding, and stores the result Into 
the register Rc, as shown in Fig. 48. Other behavior Is equivalent to 
that of Instruction vaddhvc. A detailed behavior is as shown in 
Flg.49. 

This Instruction is also effective when used for motion 
compensation in Image processing. 

Note that as a functionality of each of the above instructions 
vaddhvc and vaddrhvc, 1-blt shift right (processing to perform a 
division by 2) may be added. Such functionality enables a 
processor to directly determine pixel values of integer pixels and 
half pixels. 

Moreover, it may be also possible to define an instruction 
having functionalities of both Instruction vaddhvc and Instruction 
vaddrhvc. An example of such Instruction is one which is capable of 
behaving either as Instruction v addhvc or Instruction vaddhrvc 



-36- 



depending on a value of a condition flag. Sucii instruction allows a 
single program to perform processing regardless of whether 
rounding is performed or not. 

[Instruction vsgnh] 
5 Instruction vsgnh is a SIMD instruction for generating a value 

depending on the sign (positive/negative) of the value held in a 
register and whether a value held in a register is zero or not. For 
example, when 

vsgnh Ra, Rb 

10 the processor 1 stores one of the following values into the register 
Rb in half word vector format, as shown In Fig.50:( i ) 1 when the 
value held in the register Ra is positive, ( ii ) -1 when the value held 
in the register Ra is negative, and (iii) 0 when the value held in the 
register Ra is 0. A detailed behavior is as shown in Fig. 51. 

15 This instruction is effective when used for inverse 

quantization in image processing since 1 is outputted when a certain 
value Is positive, -1 when negative, and 0 when 0. In the processor 
1, In particular, values on which SIMD operations are difficult to be 
performed can be calculated at an Increased speed. 

20 [Instruction valnvcl, valnvc2, valnvcB, valnvc4] 

Instruction valnvcl is a SIMD instruction for byte-aligning 
data and e xtracting different b yte d ata d epending on a vector 
condition flag. For example, when 
valnvcl Rc, Ra, Rb 

25 the processor 1, performs byte-alignment by shifting a bit string 
resulted from concatenating the registers Ra and Rb according to a 
value indicated by Bit ALN[1:0] of the condition flag register (CFR) 
32, and stores four pieces of byte data which have been extracted 
depending on a value of the vector condition flag VCO, as shown in 

30 Fig. 52. More specifically, the processor 1 extracts four pieces of 
byte data "a, a, b, and b" from byte-aligned data and stores them in 
the register Rc when the vector condition flag VCO=0, while 
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extracting four pieces of byte data ''a, b, b, and c" from byte-aligned 
data and stores them in the register Rc when the vector condition 
flag VC0 = 1. A detailed behavior is as shown in Fig. 53. 

This Instruction is effective when used for motion 

5 compensation in Image processing. Since a value resulted from 
dividing the value held In the addition result register Rc by 2 on a 
per-half word vector basis equals to ''a" and ''b", or (a + b)/2 and 
(b+c)/2, there is an advantage that a single program can support 
half-pel motion compensation (motion compensation performed on 

10 a per-half-pixel basis) regardless of whether pixels are integer 
pixels or half pixels, as shown in Fig. 47. 

Note that basic behavior of each of Instructions valnvc2, 
valnvcS, and valnvc4 is the same as that of the above-explained 
Instruction valnvcl, other than that where in byte-aligned data 

15 pieces of byte data are extracted, as shown in Fig. 52. A detailed 
behavior of the respective instructions is as shown in Figs. 54, 55 
and 56. Thus, these instructions are also effective when used for 
motion compensation in image processing. 

Also note that the present invention is not limited to byte as 

20 a unit of alignment, and therefore that half word and half byte may 
also serve as a unit of alignment. 
[Instruction addarvw] 

Instruction addarvw is an instruction for adding two values 
and further adding 1 when one of such values is positive. For 
25 example, when 

addarvw Rc, Rb, Ra 
the processor 1, using the arithmetic and logic/comparison 
operation unit 41 and others, adds the value held in the register Ra 
and the value held in the register Rb, as shown in Fig. 57. When this 
30 is done, the processor 1 further adds 1 when the value held in the 
register Ra is positive. A detailed behavior is as shown in Fig. 58. 

This instruction is effective when used for ''rounding of an 
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absolute value (away from zero)". As shown In Fig. 59, a value to be 
rounded Is stored in the register Ra, and a value resulted from filling, 
with 1, a bit corresponding to one lower than the bit to be rounded 
shall be stored in the register Rb. When this instruction Is executed 

5 after this, a result generated by rounding the absolute value of the 
value held in the register Ra (here, the most significant bit Is a sign 
bit, and therefore the value held in Ra is fixed point data which has 
a point between the second bit and the third bit from the most 
significant bit) is to be stored in the register Rc. In an example 

10 illustrated In Fig. 58, by masking bits other than the higher 2 bits of 
the register Ra, +1 Is obtained for +0.5, and -1 is obtained for -0.5, 
and absolute value rounding Is realized. Thus, this Instruction Is 
effective when used for rounding absolute values in Image 
processing. 

15 [Instruction movp] 

Instruction movp is an instruction for moving values held in 
arbitrary two registers to two consecutive registers. For example, 
when 

movp Rc: Rc+1, Ra, Rb 
20 the processor 1, using the I/F unit 50 and others, moves the value 
held in the register Ra to the register Rc, and moves the value held 
In the register Rb to the register Rc+1, as shown In Fig. 60. A 
detailed behavior Is as shown In Fig. 61. 

Since values held in Independent two registers are moved In 
25 one cycle under this Instruction, an effect of reducing the number of 
cycles in a loop can be achieved. Also, this instruction, which does 
not involve register renaming (destruction of a register value), is 
effective when data is moved between loop generations (iterations). 
Note that move ("mov") is not an exclusive type of operations, 
30 and therefore unary operations (e.g. "neg") and binary operations 
("add") are also In the scope of the present invention. For example, 
regarding an add instruction in which arbitrary two registers (RO and 
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R6) and two consecutive registers (R2 and R3) are specified, two 
add operations, i.e. "R0h-R2-^R2" and "R6+R3^R3" are performed 
in a single instruction (in one cycle). 
[Instruction jloop, settar] 
5 Instruction jloop is an instruction for performing branches 

and setting condition flags (predicates, here) in a loop. For 
example, when 

jloop C6, Cm, TAR, Ra 
the processor 1 behaves as follows, using the address management 

10 unit 10b and others: ( I ) sets 1 to the condition flag Cm; ( ii ) sets 0 
to the condition flag C6 when the value held in the register Ra is 
smaller than 0; ( iii ) adds -1 to the value held in the register Ra and 
stores the result Into the register Ra; and ( iv ) branches to an 
address specified by the branch register (TAR) 30d. When not filled 

15 with a branch instruction, the jump buffer lOf (branch instruction 
buffer) is filled with a branch target instruction. A detailed 
behavior is as shown in Fig. 62. 

Meanwhile, Instruction settar is an instruction for storing a 
branch target address in the branch register (TAR) 30d, and setting 

20 condition flags (predicates, here). For example, when 
settar C6, Cm, D9 
the processor 1 behaves as follows, using the address management 
unit 10b and others: ( i ) stores an address resulted from adding the 
value held in the program counter (PC) 33 and a displacement value 

25 (D9) into the branch register (TAR) 30d; ( ii ) fetches the instruction 
corresponding to such address and stores it in the jump buffer lOf 
(branch instruction buffer); and (iii) sets the condition flag C6 to 1 
and the condition flag Cm to 0. A detailed behavior is as shown in 
Fig. 63. 

30 These instructions jloop and settar, which are usually used in 

pairs, are effective when used for increasing a loop speed by means 
of PROLOG/EPILOG removal software pipelining. Note that 
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software pipelining, which is a technique to increase a loop speed 
used by a compiler, allows efficient parallel execution of a plurality of 
instructions by converting a loop structure into a PROLOG portion, a 
KERNEL portion and an EPILOG portion, and by overlapping each 
5 iteration with t he p revious I teratlon and t he following i teratlon 
regarding the KERNEL portion. 

"PROLOG/EPILOG removal" Is Intended to visually remove a 
PROLOG portion and an EPILOG portion by using the PROLOG portion 
and the EPILOG portion as condition execution Instructions to be 
10 performed according to predicates, as shown in Fig. 64. In 
PROLOG/EPILOG removal 2-stage software pipelining shown In 
Fig. 64, the condition flags C6 and C4 are illustrated as predicates for 
an EPILOG instruction (Stage 2) and a PROLOG instruction (Stage 1), 
respectively. 

15 The following gives an explanation for the significance of the 

above Instructions jloop and settar's functionality of moving flags 
(setting of the condition flag Cm), in comparison with ordinary 
instructions jloop and settar without such functionality. 

When Instruction jloop and Instruction settar according to the 

20 present embodiment are not included In an instruction set, i.e. when 
an Instruction set Includes only ordinary jloop and settar 
instructions, the condition flag Cm needs to be moved In the 
respective ordinary jloop and settar instructions In an independent 
manner. For this reason, the following problems occur: 

25 (1) There is an Increase in the number of flag move 

instructions, which are unrelated to the original functionality of a 
loop execution, and the performance of a processor is degraded due 
to PROLOG/EPILOG removal software pipelining; 

(2) Dependency on data among flags grows stronger, and the 
30 performance of a processor is degraded due to data dependency 

among flags, locational limitations and the like; and 

(3) There arises the need that there shall be an inter-flag 
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move instruction, which is not originally required to be included in 
an instruction set, and therefore there will be a scarcity of the bit 
field space of the instruction set. 

For example, when the ordinary jloop and settar instructions 

5 are used in a source program written in the C language shown in 
Fig. 65, a compiler generates a machine language program shown in 
Fig. 66 by means of PROLOG/EPILOG removal software pipelining. 
As indicated by the loop part in such machine language program 
(Label L00023'^'Instruction jloop), 3 cycles are involved in loop 

10 execution since an instruction for setting the condition flag C4 
(Instruction cmpeq) is required. Furthermore, two instructions are 
required for the setting and resetting of the condition flag C4, 
reducing the effect of PROLOG/EPILOG removal. 

In contrast, when Instruction jloop and Instruction settar 

15 according to the present embodiment are included in an instruction 
set, a compiler generates a machine language program shown in 
Fig. 67. As indicated by the loop part in such machine language 
program (Label L00023~Instruction jloop), the setting and resetting 
of the condition flag C4 are conducted under Instructions jloop and 

20 settar, respectively. This reduces the need for any special 
instructions, allowing loop execution to complete in 2 cycles. 

As is obvious from the above. Instruction "jloop C6, Cm, 
TAR, Ra" and Instruction ''settar C6, Cm, D9" are effective for 
reducing the number of execution cycles in 2-stage PROLOG/EPILOG 

25 removal software pipelining. 

Note that the processor 1 supports instructions which are 
applicable not only to 2-stage software pipelining, but also to 
3-stage software pipelining: Instruction ''jloop C6, C2: C4, TAR, 
Ra" and Instruction "settar C6, C2: C4, D9". These instructions 

30 "jloop C6, C2: C4, TAR, Ra" and "settar C6, C2: C4, D9" are 
equivalent to instructions in which the register Cm in the 
above-described 2-stage instructions "jloop C6, Cm, TAR, Ra" and 
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"settar C6, Cm, D9" are extended to the registers 0,1, C3 and C4. 
To put it another way, when 
jloop C6, C2: C4, TAR, Ra 
the processor 1 behaves as follows, using the address management 

5 unit 10b and others: ( i ) sets the condition flag C4 to 0 when the 
value held In the register Ra is smaller than 0; ( ii ) moves the value 
of the condition flag C3 to the condition flag C2 and moves the value 
of the condition flag C4 to the condition flags C3 and C6; (ill) adds 
-1 to the register Ra and stores the result into the register Ra; and 

10 (iv) branches to an address specified by the branch register (TAR) 
30d. When not filled with a branch instruction, the jump buffer lOf 
(branch Instruction buffer) Is filled with a branch target instruction. 
A detailed behavior is as shown in Fig. 68. 
Also, when 

15 settar C6, C2: C4, D9 

the processor 1 behaves as follows, using the address management 
unit 10b and others: ( i ) stores an address resulted from adding the 
value held in the program counter (PC) 33 and a displacement value 
(D9) into the branch register (TAR) 30d; ( ii ) fetches the instruction 

20 corresponding to such address and stores it in the jump buffer lOf 
(branch instruction buffer); and (iii) sets the condition flags C4 and 
C6 to 1 and the condition flags C2 and C3 to 0. A detailed behavior 
is as shown in Fig. 69. 

Figs.70A and 70B show the role of the condition flags in the 

25 above 3-stage instructions "jloop C6, C2: C4, TAR, Ra" and 
"settar C6, C2: C4, D9". As shown in Fig.70A, in PRO LOG/ EPILOG 
removal 3-stage software pipelining, the condition flags C2, C3 and 
C4 are predicates intended for Stage 3, Stage 2 and Stage 1, 
respectively. Fig.70B is a diagram showing how instruction 

30 execution is going on when moving flags in such a case. 

Figs.71~73 show program examples illustrating the 
significance of moving flags in the above instructions "jloop C6, 
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C2: C4, TAR, Ra" and "settar C6, C2: C4, D9". Fig. 71 shows an 
example of a source program, Fig. 72 shows an example of a machine 
language program created by using ordinary instructions jloop and 
settar without the functionality of moving flags, and Fig. 73 shows an 
5 example of a machine language program created by using 
Instruction jloop and Instruction settar according to the present 
embodiment that have the functionality of moving flags. As Is 
obvious from the comparison between Fig, 72 and Fig. 73, the use of 
Instruction jloop and Instruction settar according to the present 

10 embodiment that have the functionality of moving flags reduces the 
number of instructions by five as well as the number of times a loop 
Is executed by one cycle. 

Note that the above description applies to software pipelining 
involving four or more stages, and the number of condition flags for 

15 predicate simply needs to be increased in such a case. 

In addition to the characteristic Instructions described above, 
the processor 1 is also capable of executing the following 
characteristic instructions which are not shown in Figs.21~36. 
[Instruction vsada] 

20 Instruction vsada is a SIMD instruction for determining a sum 

of absolute value differences. For example, when 

vsada Rc, Ra, Rb Rx 
the processor 1, using the arithmetic and logic/comparison 
operation unit 41 and others, performs SIMD operations for 

25 determining differences between the values held in the register Ra 
and the values held in the register Rb on a byte-by-byte basis 
(determines the difference between the respective four byte pairs), 
as shown in Fig. 74, determines the absolute value of each of the four 
results so as to add them, adds the value held in the register Rx to 

30 this addition result, and stores the final result into the register Rc. 
A detailed behavior Is as shown In Flg.75A. 

Note that the processor 1 is also capable of executing an 



-44- 



instruction whicin does not include the last operand (Rx) in the 
format of the above Instruction vsada. For example, when 
vsada Rc, Ra, Rb 

the processor 1, using the arithmetic and logic/comparison 
5 operation unit 41 and others, performs SIMD operations for 
determining differences between the values held in the register Ra 
and the values held In the register Rb on a byte-by-byte basis 
(determines the difference between the respective four byte pairs), 
determines the absolute value of each of the four results so as to add 

10 them, and stores the result into the register Rc. A detailed 
behavior is as shown in Flg.75B. 

These instructions vsada are Instructions resulted from 
compounding Instruction vasubb and Instruction vabssumb. 
Instruction vasubb is a SIMD Instruction for performing subtractions 

15 on four pairs of SIMD data on a byte-by-byte basis, and storing the 
resulting four signs in the condition flag register. Instruction 
vabssumb, on the other hand, is a SIMD instruction for adding the 
absolute values of four pairs of SIMD data on a byte-by-byte basis 
according to the condition flag register, and adding this addition 

20 result to another 4-byte data. 

Thus, Instruction vsada makes it possible for a sum of 
absolute value differences to be determined in one cycle and 
therefore makes faster the speed of operations, as compared with 
the case where Instruction vasubb and Instruction vabssumb are 

25 used In succession. Instruction vasada is effective when used for 
summing up absolute value differences In motion prediction as part 
of Image processing. 

Note that data does not have to be In byte, and therefore half 
word, half byte and other units are also in the scope of the present 

30 invention. 

[Instruction satss, satsu] 

Instruction satss is an instruction for converting a signed 
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value into a saturated signed value at an arbitrary position (digit). 
For example, when 

satss Rc, Ra, Rb 

the processor 1, using the saturation block (SAT) 47a and others, 
5 stores, into the register Rc, a saturated value (complement on 1 of 
the register Rb) specified by the register Rb when the value held in 
the register Ra Is larger than such saturated value, and stores the 
value held in the register Ra into the register Rc when the value held 
in the register Ra Is equal to or smaller than the saturated value, as 
10 illustrated in Fig. 76. A detailed behavior is as shown in Fig.77A. 

Meanwhile, Instruction satsu is an instruction for converting 
an unsigned value into a saturated signed value at an arbitrary 
position (digit). For example, when 

satsu Rc, Ra, Rb 

15 the processor 1, using the saturation block (SAT) 47a and others, 
stores a saturated value specified by the register Rb into the register 
Rc when the value held in the register Ra is larger than such 
saturated value, and stores the value held in the register Ra into the 
register Rc when the value held in the register Ra is equal to or 

20 smaller than the saturated value. A detailed behavior is as shown 
in Fig-77B. 

The above Instruction satss and Instruction satsu allow 
saturation processing to be performed at an arbitrary position. 
This facilitates programming since there is no need for setting a 
25 position where saturation is performed to a specific position at the 
time of assembler programming. 
[Instruction bytesel] 

Instruction bytesel is an instruction for selecting one of the 
values held in two registers on a byte-by-byte basis. For example, 
30 when 

bytesel Rc, Ra, Rb, Rx 
the processor 1, using the operation unit 40 and others, stores one 
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of eight pieces of byte data held in the register Ra and the register 
Rb into the register Rc, on the basis of a value indicated by the 
register Rx, as illustrated in Fig. 78. This behavior is performed on 
four pieces of bytes i n t he register Rc 1 n p arallel. A d etailed 
5 behavior Is shown in Flg.79A, and a relationship between the 
register Rx and byte data to be selected is shown in Fig.79B. 

Note that the processor 1 behaves in an equivalent manner 
also for Instruction bytesel in the following format: when 
bytesel Rc, Ra, Rb, 112 
10 the processor 1, using the operation unit 40 and others, stores one 
of eight pieces of byte data held in the register Ra and the register 
Rb into the register Rc, on the basis of a 12-bit immediate value. 
This behavior is performed on four pieces of bytes in the register Rc 
in parallel. A detailed behavior is shown in Fig.79C, and a 
15 relationship between an immediate value 112 and byte data to be 
selected is shown in Fig.79D. 

Instruction bytesel allows byte data to be stored at an 
arbitrary position in a register, and therefore makes repetitions of 
data reshuffling faster. Moreover, this instruction has an effect of 
20 increasing the flexibility of SIMD operations. 

Note that whether the above byte data is to be stored or not 
in each of Rc[31:24], Rc[23:16], Rc[15:8], and Rc[7:0] may be 
specifiable in Instruction ''bytesel Rc, Ra, Rb, Rx" explained 
above, utilizing an empty digit or the like in the register Rx. This 
25 allows a byte-by-byte basis selection of whether the value held in 
the register Rc is to be updated or not. 

Note that data does not have to be in byte, and therefore half 
word, half byte and other units are also in the scope of the present 
invention. 

30 [Instructions for extending results of SIMD operations] 

The processor 1 is also capable of executing SIMD 
operation-related complementary processing, in addition to the 
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above-explained instructions. 

For example, the processorl, when a certain instruction is 
issued, performs complementary processing for extending a part of 
results of SIMD operations (sign extension or zero extension), as 
5 illustrated in Figs.SOA and 808, which show the processor 1 
performing SIMD operations on data at the same positions in 
respective registers (to be referred to also as ""straight positions" 
hereinafter) or on data at diagonally crossed positions, on a per-half 
word basis. FIg.SOA illustrates processing for extending the lower 
10 half word of a required result to a word, and Fig. SOB illustrates 
processing for extending the higher half word of a required result to 
a word- 

Note that Instruction vaddh is an example instruction for 
performing SIMD operations on data at straight positions on a 

15 per-half word basis, while Instruction vxaddh is an example 
instruction for performing SIMD operations on data at diagonally 
crossed positions on a per-half word basis. 

Also note that the processorl, when a certain instruction is 
issued, performs complementary processing for extending all 

20 results of SIMD operations, as Illustrated in Fig. 81. Fig. 81 
Illustrates the processor 1 performing SIMD operations on pieces of 
data stored at straight positions or diagonally crossed positions in 
two registers on a per-half word basis, as well as extending each of 
resulting two half words to a word- 

25 Such an instruction for extending results of SIMD operation as 

above is effective when making data size all the same by performing 
sign extension or zero extension after performing the SIMD 
operations, enabling SIMD operations and extension processing to 
be performed in one cycle. 

30 Furthermore, the processor 1 is also capable of executing 

SIMD operations specified by condition flags and the like, as SIMD 
operation-related complementary instructions. For example, the 
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processor 1, when condition flags specify that the first and the 
second operations should be ''addition" and ''subtraction" 
respectively, performs additions and subtractions on each of data 
pairs in two registers at straight positions or diagonally crossed 
5 positions on a per-half word basis, as illustrated in Fig. 82. 

For example, when the condition flags CO and CI are "'1 and 0", 
the processor 1 behaves as follows, using the arithmetic and 
logic/comparison operation unit 41 and others: 

(1) adds the higher half word of the register Ra with the 
10 higher half word of the register Rb, and stores this addition result 

into the higher half word of the register Rc; and 

(2) subtracts the lower half word of the register Rb from the 
lower half word of the register Ra, and stores this subtraction result 
into the lower half word of the register Rc. 

15 Such an instruction in which types of SIMD operations are 

specifiable is effective for processing in which types of operations to 
be performed are not fixed, and therefore in which an operation shall 
be determined depending on a result of other processing. 

Note that present invention is applicable to a case where the 

20 register Rb is not used in the above operations (1) and (2). For 
example, the processor 1 may: 

(1) add the higher half word of the register Ra with the lower 
half word of the register Ra, and store this addition result into the 
higher half word of the register Rc; and 

25 (2) subtract the lower half word of the register Ra from the 

higher half word of the register Ra, and store this subtraction result 
into the lower half word of the register Rc. 
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